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ABSTRACT 

We present a new approach to evaluate chord recognition 
systems on songs which do not have full annotations. The 
principle is to use online chord databases to generate high 
accurate "pseudo annotations" for these songs and com- 
pute "pseudo accuracies" of test systems. Statistical mod- 
els that model the relationship between "pseudo accuracy" 
and real performance are then applied to estimate test sys- 
tems' performance. The approach goes beyond the existing 
evaluation metrics, allowing us to carry out extensive anal- 
ysis on chord recognition systems, such as their general- 
izations to different genres. In the experiments we applied 
this method to evaluate three state-of-the-art chord recog- 
nition systems, of which the results verified its reliability. 

1. INTRODUCTION 

In recent years, audio chord recognition has become a very 
active field 1 1 4 <). I()| due to the increasing popularity of 
Music Information Retrieval (MIR) with applications us- 
ing mid-level tonal features has established chord recogni- 
tion as a useful and challenging task. 

Generally speaking, chord recognition is a task of auto- 
matically detecting chord labels and boundaries from the 
audio of a musical piece. The process involves segmenting 
a song into a high time resolution sequence of windows 
(known as frames), after which machine learning techniques 
(e.g. Hidden Markov Models) are utilized to detect chord 
label for each frame, based on the features extracted and 
the local context. The chord predictions can then be evalu- 
ated via frame-wise accuracies, if the ground truth annota- 
tion of the song is available. 

The annual MIREX (Music Information Retrieval Eval- 
uation eXchange) competitiorQ has a task dedicated to 
chord recognition, where participants attempt to generate 
chord predictions for a collection of songs. In the most re- 
cent competitions, the dataset used is a collection of Beat- 
les, Queen and Zweieck songs, of which the ground truth 
annotations are available. Due to the limited amount of 
data, existing chord recognition systems (referred as test 
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systems in the paper) are usually trained and tested on the 
same songs, inevitably causing over-fitting on this dataset. 
Meanwhile, the evaluation is also heavily constrained by 
the simplicity of the data. For example, most of the songs 
in the dataset are from Rock genre, implying that the per- 
formance lacks generalization to other genres. 

To resolve these problems, the simplest, but most costly 
and least scalable solution would be to obtain more fully 
annotated data, paying trained musicians to annotate new 
songs. Alternatively, we propose using a methodologi- 
cally more challenging but cheaper and scalable approach: 
meta-song evaluation, which makes use of large and freely 
available online chord databases, such as E-chord^\ to 
help evaluate test systems. The principle is to automati- 
cally generate chord annotations for new songs of which 
the chord sequences are available on these databases. The 
songs and the generated annotations are then used to esti- 
mate test systems' performance via statistical theories. 

However, chord sequences from such databases are gen- 
erally less directly usable than those produced by musi- 
cians, since exact timings of chords are absent and some- 
times the chord sequences are affected by various types 
of errors and omissions. Hence, a system, referred to as 
reference system in the paper, is required to generate high 
accurate annotations from these untimed chord sequences. 
As demonstrated in our previous work J6), we have de- 
signed a variety of reference systems from which the gen- 
erated annotations are more accurate than most of the ex- 
isting chord recognition systems. We regard these high ac- 
curate, but not perfect annotations as pseudo annotations. 
In the rest of this paper, we will show how to make use 
of these pseudo annotations to comprehensively evaluate 
performances of different test systems. 

2. MATHEMATICAL FRAMEWORK 

We use yf and xf to denote the ground truth ( GT) accu- 
racy and the pseudo accuracy (i.e. the accuracy of system's 
prediction compared to pseudo annotation) of system A's 
chord prediction for the i-th song. Then for each system 
we obtain two sets of data: a validation set {xf, yfYt=\ 
and a test set {xf}j=™ + i- Note that we only have ground 
truth annotations on the validation set such that generally 
m 3> n. The test system pool is denoted by A G A. 

One observation from the validation set is that the pseudo 
accuracies are highly correlated with the GT accuracies 
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Figure 1. The relationship between the pseudo accuracies 
and the real GT accuracies on 175 The Beatles songs. The 
three systems are: A. HP, B. labROSA and C. Chordino. 
The pseudo annotations of the songs are generated by the 
Jump Alignment method [ 6 1 , which has been shown to pro- 
duce more accurate chord predictions than all other sys- 
tems. Note that there are some outliers (represented by cir- 
cled points on the top figure), of which the online chord se- 
quences are less informative (e.g. the chord sequence only 
records the solo of the song). In these cases, the resulting 
pseudo accuracies are not well-correlated with the GT ac- 
curacies. How to reduce and eliminate these outliers will 
be investigated in our future work. 



(see FigureQ]), as long as the pseudo annotations are accu- 
rate enough. In the ideal case, if all pseudo annotations are 
100% accurate, the pseudo accuracies will converge to the 
GT accuracies. Inspired by this observation, we propose 
three mathematical frameworks to model the relationship 
between GT and pseudo accuracies on the validation set, 
which can then be applied to estimate GT accuracies on 
the test set. 

2.1 Single Gaussian model 

This model assumes that the pairs (xi, y.i) generated by all 
systems A are sampled i.i.d from a single Gaussian distri- 
bution A/"(/i, a 2 ). That is 

yf = xf + + e h 1 < i < n, ei ~ Af(0, cr 2 ), VA e A. 

(1) 

The parameters of the distribution /i can then be easily es- 
timated by least square Q, resulting in 

n 



1 1 A<EA 1=1 

The unbiased estimator of a 2 can be calculated by the for- 
mula 



\A\n - 
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Using the parameters (fi, a 2 ) estimated from the valida- 
tion set, we are able to predict the GT accuracies {yf}™^™^ 
on the test set using the linear regression theory. Let x — 

n n 
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following Gaussian distribution holds for all test exam- 
ples|Z]0: 
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Therefore, with probability 1 — a the confidence interval 
of yf is 
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where Q denotes a normal quantile function Q(p) = inf{y £ 
K : p < Pr(Y < y)}. 

We then extend Eq. (0 and estimate the mean accuracy 
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student-t distribution instead of a Gaussian. But since n is large enough 
(n > 100) in our case, we approximate the student-t distribution as a 
Gaussian. 



/it ± Q(l - a)a A . 
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Apart from estimating the confidence interval of the GT 
accuracies, the Gaussian distribution also allows us to com- 
pare two systems A and B, by means of estimating the 
confidence interval of y A — y B using 
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To compare the two systems A and B, we now have a 
term p, A — P-b to reduce the effect of the biases, yielding 
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We derive the following confidence interval with probabil- 
ity 1 — a 
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This yields the following confidence interval with proba- 
bility 1 — a 
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The advantage of the single Gaussian model is that it 
makes use of \A\ times data to estimate the Gaussian pa- 
rameters, which is expected to provide more robust esti- 
mation. However, as we observed in the experiments, test 
systems that are closer to the reference system generally 
got higher pseudo accuracies than the others. In this case, 
the GT accuracies estimated by the single Gaussian model 
would bias towards these systems. 

2.2 Individual Gaussian model 

To reduce or eliminate the effect of such biases, we pro- 
posed a variant of the single Gaussian model, fitting indi- 
vidual Gaussians to different test systems. Mathematically, 
the GT accuracy yf is now modelled as 

yf = xf l<i<n, a ~7V(0,oi), VA € A, 

(6) 

where the parameters ((J,a, & a ) can ^ e l earnt from the val- 
idation data {xf,yf}f =1 : 
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Here we denote x = — Y] xf and s 2 = -^r Y\ (xf — 

i=l i=l 

x) 2 , then following the same procedure as described in 
Sectionl2~f1we obtain 
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and with probability 1 — a the confidence interval of yf is 
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2.3 Linear regression model 

Apart from applying different fx A to eliminate the biases, 
one can also learn the slope of the regression line to better 
fit the validation samples. Mathematically, the relationship 
between yf and xf is now formulated as 

yf = a A xf+b A +e u 1 < i < n, e { ~ Af(0, a 2 A ), MA E A, 

(10) 

where the parameters (a A ,b A: a 2 A ) can be estimated by 
least square: MA 6 A 
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Given the parameters, a test sample y/ 1 follows the fol- 



lowing Gaussian distribution 
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and its confidence interval is of the form 



yf = a A xf + b A ±Q(l-a)„;, { . , , 
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Analogously to that presented in Section 12/21 the mean 
accuracy y A satisfies y A — a A x A ~b A ^ N(0, a A ), and 
with probability 1 — a the confidence interval of y A is 



y A = a A x A + b A ± Q(l - 



(12) 



To compare the two systems A and B, the confidence 
interval of y A — y B is now calculated by 



y A -y B = a A x A -a B x B +b A -b B ±Q{l-a)^a 2 A + &%. 

(13) 



3. EXPERIMENTS 

Here we summarize the main experiments conducted, which 
consist of the estimation and comparison of the perfor- 
mances of three pre-trained chord recognition systems: the 
HP system [8| that is trained on the audio dataset used in 
the MIREX Chord Detection task 2O1C0 the labROSA 
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System 


S model [%] 


I model [%] 


L model [%] 


GT acc [%] 


HP 


82.6 ± 1.3 


82.2 ± 1.3 


82.2 ± 1.3 


82.2 


labROSA 


76.6 ± 1.3 


77.6 ± 1.4 


77.6 ± 1.3 


77.6 


Chordino 


75.2 ± 1.3 


76.2 ± 1.0 


76.2 ± 1.0 


76.2 


Consensus 


82.3 ± 1.3 


82.7 ± 1.2 


82.7 ± 1.2 


82.7 



System 


S model [%] 


I model [%] 


L model [%] 


GT acc [%] 


HP - labROSA 


6.6 ± 1.9 


4.6 ± 1.9 


4.6 ± 1.9 


4.6 


HP - Chordino 


8.0 ± 1.7 


6.0 ± 1.6 


6.0 ± 1.6 


6.0 


HP - Consensus 


0.3 ± 1.8 


-0.5 ± 1.8 


-0.5 ± 1.8 


-0.5 


labROSA - Chordino 


1.3 ±0.9 


1.4 ±1.7 


1.4 ± 1.6 


1.4 



Table 1. Upper table: the estimation of performance of HP, labROSA and Chordino on the validation set. The first three 
columns are the estimated mean GT accuracies using Eq. (0J, dHJ and Sl2\ respectively, where the confidence level is fixed 
at 95%. The forth column is the real GT accuracies. Lower table: the comparison of performances between test systems, 
using Eq. (0, (01, (13[ and real GT accuracy differences respectively. 



system [2] which is trained on the same dataset, and fi 
nally Chordino, a freely-available pre-trained chord recog 
nition system 0. The reference system used to general 
pseudo annotations is the Jump Alignment (JA) methoi 
[6 1, which has shown to produce more accurate chord pre 
dictions than all other systems, by means of using the on 
line chord database E-chords. The validation set consist 
of 175 The Beatles' songs, of which we have both grouni 
truth and pseudo annotations. This set is used to learn th 
parameters of the single Gaussian (S), the individual Gaus 
sian (I) and the linear regression (L) models. The test se 
consists of 1840 songs from a variety of genres, of whicl 
we can only derive pseudo annotations using JA. The ob 
jective of the experiments is to estimate and compare th 
GT accuracies of the three systems on the test set, in term 
of the S, I and L models. 

3.1 Verification 

We first regarded the validation set as test set so as to ver- 
ify the confidence intervals estimated by S, I and L models. 
The results are shown in Table Q] All real GT accuracies 
fall in the estimated interval with a 95% confidence level, 
verifying the reliability of the models. We also observed 
from Table [TJ (lower table) that S model biases towards HP 
as expected, because HP shared the same chromagram fea- 
tures with the JA method. This bias was then removed by 
using I/L models. 

3.2 Performance estimation 

We then estimated and compared performances of the sys- 
tems on the large test set, of which the ground truth annota- 
tions are not available. Again, we first estimated the mean 
GT accuracies of the three systems, in terms of S/I/L mod- 
els and pseudo annotations generated by JA. The results are 
illustrated in Figure [2] We observed that the estimated ac- 
curacies between labROSA and Chordino are highly over- 
lapped, indicating a similar performance of the two sys- 
tems. Alternatively, there is a large gap between HP and 
the other two systems, implying the superiority of the HP 
system. We also observed that S model ranked higher than 
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Figure 2. The estimated mean GT accuracies of the test 
systems on a large test set (1840 songs). 



I model on HP, which is different from the cases for the 
other systems. This implies a bias towards HP, which how- 
ever, was eliminated by I/L models. 

Finally, we categorized the test songs by their genre and 
estimated the mean GT accuracies of the test systems on 
each genre. The results are illustrated on Figure|3]to Figure 
We observed that HP performs better on most of genres, 
especially on Rock related genres. This is consistent with 
the fact that the system is trained on songs mainly from the 
Rock genre. Meanwhile, the performances of labROSA 
and Chordino are highly overlapped, except for some gen- 
res containing few songs, which may happen by chance. 

3.3 Consensus 

Inspired by the fact that the best test system HP does not 
always outperform the other two in genre-specific estima- 
tion, we tried to combine predictions from the three sys- 
tems so as to improve the recognition accuracy. As a trial, 
we simply combined predictions on each frame by major- 
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Figure 3. Estimated GT accuracies of the test systems on each genre, using S model. 
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Figure 4. Estimated GT accuracies of the test systems on each genre, using I model. 
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Figure 5. Estimated GT accuracies of the test systems on each genre, using L model. 



ity vote, which yielded a consensus prediction of the three 
systems (denoted by "Consensus"). The performance of 
the Consensus system are presented in Table [TJ as well as 
Figure [2] to Figure [5] It is promising to see that the con- 
sensus prediction performs slightly better than HP (on I/L 
models), by means of compensating the low performance 
of HP on certain genres (e.g. Funk and Blues). This ob- 
servation is sufficiently encouraging that an investigation 
of combining systems' predictions using machine learning 
techniques will be carried out in the future. 

4. CONCLUSIONS AND FUTURE WORK 

We have proposed a new method to evaluate chord recogni- 
tion systems on songs which do not have full annotations. 
The approach goes beyond the existing evaluation metrics, 
allowing us to carry out extensive analysis on chord recog- 
nition systems, such as their generalizations to different 
genres. In the experiments, we tested this method on three 
systems, and the resulting confidence intervals on a valida- 
tion set verified its reliability. We then evaluated these sys- 
tems on a much larger test set and obtained some promising 
observations which can not be achieved by current evalu- 
ation techniques. These observations inspired us to com- 
bine predictions of different systems, and the resulting con- 
sensus system achieved the best performance by means of 
compensating weakness of a specific system. 

For the future work, we aim at improving the reliability 
of the statistical models proposed. Since there may be er- 
rors and omissions in chord sequences obtained from the 
online databases, these chord sequences may become out- 
liers in the validation and test sets (e.g. circled points in 
Figure[TJi. A method to detect and remove these outliers is 
then a direction of our future work. Meanwhile, as pointed 
out in Section [331 an investigation of combining systems' 
predictions using machine learning techniques will also be 
conducted. 
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